{ "cells": [ { "cell_type": "markdown", "id": "respective-kennedy", "metadata": {}, "source": [ "# Chapter 3: Feature Extraction from Text Data" ] }, { "cell_type": "markdown", "id": "floating-dividend", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "id": "subsequent-forwarding", "metadata": {}, "source": [ "Feature extraction is a pivotal step in the text mining process. Essentially, it translates textual data into a numerical form so that machine learning models can understand. It is the bedrock of many natural language processing tasks. Sklearn provides a suite of tools to efficiently transform text data into a format suitable for machine learning. Through African-context examples, we've witnessed the versatility and applicability of these tools across various textual scenarios. As we venture into more advanced topics, mastering the basics of feature extraction remains paramount.\n", "\n", "This chapter offers an exploration into sklearn's text feature extraction techniques." ] }, { "cell_type": "markdown", "id": "advance-marks", "metadata": {}, "source": [ "**Learning Objectives:**" ] }, { "cell_type": "markdown", "id": "visible-colony", "metadata": {}, "source": [ "* **Understand Basic Text Representation:** Comprehend the necessity of converting textual data into numerical format for machine learning applications, and appreciate the significance of feature extraction in text mining.\n", "\n", "* **Master CountVectorizer:** Confidently utilize the CountVectorizer method to transform text documents into a matrix of token counts, distinguishing how individual words and tokens are represented in this format.\n", "\n", "* **Differentiate Vectorization Techniques:** Discern the differences between TfidfVectorizer and the combination of CountVectorizer with TfidfTransformer. Know when to apply each method based on the task at hand." ] }, { "cell_type": "markdown", "id": "boring-photograph", "metadata": {}, "source": [ "## Understanding Document-Term Matrix (DTM)" ] }, { "cell_type": "markdown", "id": "hungry-arthur", "metadata": {}, "source": [ "\n", "The Document-Term Matrix (DTM) is a matrix representation of the text dataset where each row corresponds to a document, and each column represents a term (typically a word), and each cell contains the frequency of the term in the document.\n", "\n", "Consider two sentences:\n", "1. \"I love machine learning.\"\n", "2. \"Learning machine algorithms is fun.\"\n", "\n", "The DTM for these sentences would have a row for each sentence and columns for each unique word.\n" ] }, { "cell_type": "markdown", "id": "following-bailey", "metadata": {}, "source": [ "## CountVectorizer" ] }, { "cell_type": "markdown", "id": "aggressive-commission", "metadata": {}, "source": [ "`CountVectorizer` turns text documents into a matrix of token counts. Each row will represent a document, and each column will represent a token (word), with the value indicating the count of the token in the respective document." ] }, { "cell_type": "code", "execution_count": 2, "id": "robust-coupon", "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "import pandas as pd\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "defensive-baking", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
algorithmsfunislearninglovemachine
0000111
1111101
\n", "
" ], "text/plain": [ " algorithms fun is learning love machine\n", "0 0 0 0 1 1 1\n", "1 1 1 1 1 0 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Constructing a DTM using the above example\n", "\n", "sample_sentences = [\"I love machine learning.\", \"Learning machine algorithms is fun.\"]\n", "\n", "vectorizer = CountVectorizer()\n", "X0 = vectorizer.fit_transform(sample_sentences)\n", "\n", "# Convert to a DataFrame for better visualization\n", "df = pd.DataFrame(X0.toarray(), columns=vectorizer.get_feature_names_out())\n", "df" ] }, { "cell_type": "code", "execution_count": 4, "id": "beginning-clock", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bustlingcairocapitalcityegyptheartiniskenyalagosnairobinigeriaofthe
000100001101011
110010011010100
201001101000011
\n", "
" ], "text/plain": [ " bustling cairo capital city egypt heart in is kenya lagos \\\n", "0 0 0 1 0 0 0 0 1 1 0 \n", "1 1 0 0 1 0 0 1 1 0 1 \n", "2 0 1 0 0 1 1 0 1 0 0 \n", "\n", " nairobi nigeria of the \n", "0 1 0 1 1 \n", "1 0 1 0 0 \n", "2 0 0 1 1 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs_1 = [\"Nairobi is the capital of Kenya.\", \n", " \"Lagos is a bustling city in Nigeria.\", \n", " \"Cairo is the heart of Egypt.\"]\n", "\n", "vectorizer = CountVectorizer()\n", "vectorizer_2 = CountVectorizer(stop_words='english')\n", "X1= vectorizer.fit_transform(docs_1)\n", "\n", "# Convert to a DataFrame for better visualization\n", "capitals_df = pd.DataFrame(X1.toarray(), columns=vectorizer.get_feature_names_out())\n", "capitals_df\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "planned-lambda", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bustlingcairocapitalcityegyptheartkenyalagosnairobinigeria
00010001010
11001000101
20100110000
\n", "
" ], "text/plain": [ " bustling cairo capital city egypt heart kenya lagos nairobi \\\n", "0 0 0 1 0 0 0 1 0 1 \n", "1 1 0 0 1 0 0 0 1 0 \n", "2 0 1 0 0 1 1 0 0 0 \n", "\n", " nigeria \n", "0 0 \n", "1 1 \n", "2 0 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer_2 = CountVectorizer(stop_words='english')\n", "X1= vectorizer_2.fit_transform(docs_1)\n", "\n", "# Convert to a DataFrame for better visualization\n", "capitals_df = pd.DataFrame(X1.toarray(), columns=vectorizer_2.get_feature_names_out())\n", "capitals_df" ] }, { "cell_type": "code", "execution_count": 6, "id": "adequate-forth", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['african' 'and' 'are' 'is' 'lessons' 'life' 'offer' 'proverbs' 'sayings'\n", " 'wealth' 'wisdom' 'wise']\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
africanandareislessonslifeofferproverbssayingswealthwisdomwise
0101000011001
1010011110010
2000100000110
\n", "
" ], "text/plain": [ " african and are is lessons life offer proverbs sayings wealth \\\n", "0 1 0 1 0 0 0 0 1 1 0 \n", "1 0 1 0 0 1 1 1 1 0 0 \n", "2 0 0 0 1 0 0 0 0 0 1 \n", "\n", " wisdom wise \n", "0 0 1 \n", "1 1 0 \n", "2 1 0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs_2 = [\"African proverbs are wise sayings.\", \n", " \"Proverbs offer wisdom and life lessons.\", \n", " \"Wisdom is wealth.\"]\n", "\n", "X2 = vectorizer.fit_transform(docs_2)\n", "print(vectorizer.get_feature_names_out())\n", "capitals_df = pd.DataFrame(X2.toarray(), columns=vectorizer.get_feature_names_out())\n", "capitals_df\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "intelligent-drill", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2nd54africacontinentcountrieshasinisitkilimanjarolargestmountaintallestthe
001101100000000
110010001101001
200100011010111
\n", "
" ], "text/plain": [ " 2nd 54 africa continent countries has in is it kilimanjaro \\\n", "0 0 1 1 0 1 1 0 0 0 0 \n", "1 1 0 0 1 0 0 0 1 1 0 \n", "2 0 0 1 0 0 0 1 1 0 1 \n", "\n", " largest mountain tallest the \n", "0 0 0 0 0 \n", "1 1 0 0 1 \n", "2 0 1 1 1 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Token Patterns (extracting only words without numbers)\n", "\n", "vectorizer_3 = CountVectorizer(token_pattern=r'\\b\\w+\\b')\n", "docs_3 = [\"Africa has 54 countries.\", \n", " \"It is the 2nd largest continent.\", \n", " \"Kilimanjaro is the tallest mountain in Africa.\"]\n", "\n", "X3 = vectorizer_3.fit_transform(docs_3)\n", "\n", "facts_df = pd.DataFrame(X3.toarray(), columns=vectorizer_3.get_feature_names_out())\n", "facts_df\n" ] }, { "cell_type": "markdown", "id": "seeing-project", "metadata": {}, "source": [ "## TfidfVectorizer" ] }, { "cell_type": "markdown", "id": "tired-inside", "metadata": {}, "source": [ "`TfidfVectorizer` converts text documents into a matrix of token counts and then transforms this count matrix into a tf-idf representation. Tf-idf stands for \"Term Frequency-Inverse Document Frequency\". It's a way to score the importance of words (tokens) in the document based on how frequently they appear across multiple documents." ] }, { "cell_type": "code", "execution_count": 8, "id": "studied-tuner", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
africaandchangeequalityforfoughtfreedominleaderleadershipmandelanelsonprofoundsawsouthunderwas
00.3241240.0000000.0000000.0000000.0000000.0000000.0000000.4261840.4261840.0000000.2517110.4261840.0000000.0000000.3241240.0000000.426184
10.0000000.4323850.0000000.4323850.4323850.4323850.4323850.0000000.0000000.0000000.2553740.0000000.0000000.0000000.0000000.0000000.000000
20.2981740.0000000.3920630.0000000.0000000.0000000.0000000.0000000.0000000.3920630.2315590.0000000.3920630.3920630.2981740.3920630.000000
\n", "
" ], "text/plain": [ " africa and change equality for fought freedom \\\n", "0 0.324124 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "1 0.000000 0.432385 0.000000 0.432385 0.432385 0.432385 0.432385 \n", "2 0.298174 0.000000 0.392063 0.000000 0.000000 0.000000 0.000000 \n", "\n", " in leader leadership mandela nelson profound saw \\\n", "0 0.426184 0.426184 0.000000 0.251711 0.426184 0.000000 0.000000 \n", "1 0.000000 0.000000 0.000000 0.255374 0.000000 0.000000 0.000000 \n", "2 0.000000 0.000000 0.392063 0.231559 0.000000 0.392063 0.392063 \n", "\n", " south under was \n", "0 0.324124 0.000000 0.426184 \n", "1 0.000000 0.000000 0.000000 \n", "2 0.298174 0.392063 0.000000 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ " # Basic Tf-idf Scores\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "docs_4 = [\"Nelson Mandela was a leader in South Africa.\", \n", " \"Mandela fought for freedom and equality.\", \n", " \"South Africa saw profound change under Mandela's leadership.\"]\n", "\n", "\n", "vectorizer_4 = TfidfVectorizer()\n", "\n", "X4 = vectorizer_4.fit_transform(docs_4)\n", "\n", "sa_facts = pd.DataFrame(X4.toarray(), columns=vectorizer_4.get_feature_names_out())\n", "sa_facts\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "severe-external", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
africaandanimalsarebigbothcatscheetahselephantsfastestforfoundinknownlandlargelionsthetheirtusks
00.2674850.3517110.0000000.2077260.3517110.3517110.3517110.2674850.0000000.0000000.0000000.3517110.2674850.0000000.0000000.0000000.3517110.0000000.0000000.000000
10.2776010.0000000.0000000.2155820.0000000.0000000.0000000.0000000.3650110.0000000.3650110.0000000.2776010.3650110.0000000.3650110.0000000.0000000.3650110.365011
20.0000000.0000000.4505040.2660750.0000000.0000000.0000000.3426200.0000000.4505040.0000000.0000000.0000000.0000000.4505040.0000000.0000000.4505040.0000000.000000
\n", "
" ], "text/plain": [ " africa and animals are big both cats \\\n", "0 0.267485 0.351711 0.000000 0.207726 0.351711 0.351711 0.351711 \n", "1 0.277601 0.000000 0.000000 0.215582 0.000000 0.000000 0.000000 \n", "2 0.000000 0.000000 0.450504 0.266075 0.000000 0.000000 0.000000 \n", "\n", " cheetahs elephants fastest for found in known \\\n", "0 0.267485 0.000000 0.000000 0.000000 0.351711 0.267485 0.000000 \n", "1 0.000000 0.365011 0.000000 0.365011 0.000000 0.277601 0.365011 \n", "2 0.342620 0.000000 0.450504 0.000000 0.000000 0.000000 0.000000 \n", "\n", " land large lions the their tusks \n", "0 0.000000 0.000000 0.351711 0.000000 0.000000 0.000000 \n", "1 0.000000 0.365011 0.000000 0.000000 0.365011 0.365011 \n", "2 0.450504 0.000000 0.000000 0.450504 0.000000 0.000000 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs_5 = [\"Lions and cheetahs are both big cats found in Africa.\", \n", " \"Elephants in Africa are known for their large tusks.\", \n", " \"Cheetahs are the fastest land animals.\"]\n", "\n", "X5 = vectorizer_4.fit_transform(docs_5)\n", "\n", "\n", "sa_facts = pd.DataFrame(X5.toarray(), columns=vectorizer_4.get_feature_names_out())\n", "sa_facts\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "coated-assistant", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
africanafrican countriesandand diversecongocongo rainforestcountriescutscuts throughdesert...several africanthethe congothe nilethe saharathroughthrough severalvastvast andvast desert
00.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.375716...0.0000000.2219040.0000000.0000000.3757160.0000000.0000000.2857420.0000000.375716
10.2845690.2845690.0000000.0000000.0000000.0000000.2845690.2845690.2845690.000000...0.2845690.1680710.0000000.2845690.0000000.2845690.2845690.0000000.0000000.000000
20.0000000.0000000.3003660.3003660.3003660.3003660.0000000.0000000.0000000.000000...0.0000000.1774010.3003660.0000000.0000000.0000000.0000000.2284360.3003660.000000
\n", "

3 rows × 30 columns

\n", "
" ], "text/plain": [ " african african countries and and diverse congo \\\n", "0 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "1 0.284569 0.284569 0.000000 0.000000 0.000000 \n", "2 0.000000 0.000000 0.300366 0.300366 0.300366 \n", "\n", " congo rainforest countries cuts cuts through desert ... \\\n", "0 0.000000 0.000000 0.000000 0.000000 0.375716 ... \n", "1 0.000000 0.284569 0.284569 0.284569 0.000000 ... \n", "2 0.300366 0.000000 0.000000 0.000000 0.000000 ... \n", "\n", " several african the the congo the nile the sahara through \\\n", "0 0.000000 0.221904 0.000000 0.000000 0.375716 0.000000 \n", "1 0.284569 0.168071 0.000000 0.284569 0.000000 0.284569 \n", "2 0.000000 0.177401 0.300366 0.000000 0.000000 0.000000 \n", "\n", " through several vast vast and vast desert \n", "0 0.000000 0.285742 0.000000 0.375716 \n", "1 0.284569 0.000000 0.000000 0.000000 \n", "2 0.000000 0.228436 0.300366 0.000000 \n", "\n", "[3 rows x 30 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer_6 = TfidfVectorizer(ngram_range=(1,2))\n", "\n", "docs_6 = [\"The Sahara is a vast desert.\", \n", " \"The Nile cuts through several African countries.\", \n", " \"The Congo rainforest is vast and diverse.\"]\n", "\n", "X6 = vectorizer_6.fit_transform(docs_6)\n", "\n", "rivers_df = pd.DataFrame(X6.toarray(), columns=vectorizer_6.get_feature_names_out())\n", "rivers_df" ] }, { "cell_type": "markdown", "id": "manual-slide", "metadata": {}, "source": [ "## TfidfTransformer" ] }, { "cell_type": "markdown", "id": "twelve-stanley", "metadata": {}, "source": [ "While `TfidfVectorizer` takes in raw text and produces tf-idf scores, `TfidfTransformer` is used after `CountVectorizer` to convert the count matrix into a tf-idf representation." ] }, { "cell_type": "code", "execution_count": 11, "id": "first-wednesday", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
accraforghanagoldhistorichostshubisitsknownofresourcesseveralsitesthe
00.3494980.0000000.3494980.0000000.0000000.0000000.4595480.3494980.0000000.0000000.4595480.0000000.0000000.0000000.459548
10.0000000.4030160.3065040.4030160.0000000.0000000.0000000.3065040.4030160.4030160.0000000.4030160.0000000.0000000.000000
20.3554320.0000000.0000000.0000000.4673510.4673510.0000000.0000000.0000000.0000000.0000000.0000000.4673510.4673510.000000
\n", "
" ], "text/plain": [ " accra for ghana gold historic hosts hub \\\n", "0 0.349498 0.000000 0.349498 0.000000 0.000000 0.000000 0.459548 \n", "1 0.000000 0.403016 0.306504 0.403016 0.000000 0.000000 0.000000 \n", "2 0.355432 0.000000 0.000000 0.000000 0.467351 0.467351 0.000000 \n", "\n", " is its known of resources several sites \\\n", "0 0.349498 0.000000 0.000000 0.459548 0.000000 0.000000 0.000000 \n", "1 0.306504 0.403016 0.403016 0.000000 0.403016 0.000000 0.000000 \n", "2 0.000000 0.000000 0.000000 0.000000 0.000000 0.467351 0.467351 \n", "\n", " the \n", "0 0.459548 \n", "1 0.000000 \n", "2 0.000000 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import TfidfTransformer\n", "\n", "docs_7 = [\"Accra is the hub of Ghana.\", \n", " \"Ghana is known for its gold resources.\", \n", " \"Accra hosts several historic sites.\"]\n", "\n", "\n", "count_vect = CountVectorizer()\n", "X7_count = count_vect.fit_transform(docs_7)\n", "\n", "tfidf_transformer = TfidfTransformer()\n", "X7_tfidf = tfidf_transformer.fit_transform(X7_count)\n", "\n", "rivers_df = pd.DataFrame(X7_tfidf.toarray(), columns=count_vect.get_feature_names_out())\n", "rivers_df\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "subject-chester", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
africabiltongdisheastfoodfrominisjolloforiginatingpopularricesnacksouthstapleugaliwest
00.2357560.000000.3991690.0000000.0000000.000000.3035780.2357560.3991690.000000.3991690.3991690.000000.000000.0000000.0000000.399169
10.2571290.000000.0000000.4353570.4353570.000000.3311000.2571290.0000000.000000.0000000.0000000.000000.000000.4353570.4353570.000000
20.2474330.418940.0000000.0000000.0000000.418940.0000000.2474330.0000000.418940.0000000.0000000.418940.418940.0000000.0000000.000000
\n", "
" ], "text/plain": [ " africa biltong dish east food from in \\\n", "0 0.235756 0.00000 0.399169 0.000000 0.000000 0.00000 0.303578 \n", "1 0.257129 0.00000 0.000000 0.435357 0.435357 0.00000 0.331100 \n", "2 0.247433 0.41894 0.000000 0.000000 0.000000 0.41894 0.000000 \n", "\n", " is jollof originating popular rice snack south \\\n", "0 0.235756 0.399169 0.00000 0.399169 0.399169 0.00000 0.00000 \n", "1 0.257129 0.000000 0.00000 0.000000 0.000000 0.00000 0.00000 \n", "2 0.247433 0.000000 0.41894 0.000000 0.000000 0.41894 0.41894 \n", "\n", " staple ugali west \n", "0 0.000000 0.000000 0.399169 \n", "1 0.435357 0.435357 0.000000 \n", "2 0.000000 0.000000 0.000000 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs_8 = [\"Jollof rice is a popular dish in West Africa.\", \n", " \"Ugali is a staple food in East Africa.\", \n", " \"Biltong is a snack originating from South Africa.\"]\n", "\n", "\n", "X8_count = count_vect.fit_transform(docs_8)\n", "X8_tfidf = tfidf_transformer.fit_transform(X8_count)\n", "\n", "afrofoods_df = pd.DataFrame(X8_tfidf.toarray(), columns=count_vect.get_feature_names_out())\n", "afrofoods_df" ] }, { "cell_type": "code", "execution_count": null, "id": "spoken-cocktail", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 13, "id": "antique-moscow", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
acrossafricanafrobeatsandarecommoncontinentdiversefestivalsgenreshighlifeismusicpopularthe
00.000000.265920.0000000.0000000.0000000.000000.000000.265920.000000.0000000.0000000.265920.2022390.0000000.00000
10.000000.000000.1735950.1735950.1320240.000000.000000.000000.000000.1735950.1735950.000000.0000000.1735950.00000
20.153350.000000.0000000.0000000.1166260.153350.153350.000000.153350.0000000.0000000.000000.1166260.0000000.15335
\n", "
" ], "text/plain": [ " across african afrobeats and are common continent \\\n", "0 0.00000 0.26592 0.000000 0.000000 0.000000 0.00000 0.00000 \n", "1 0.00000 0.00000 0.173595 0.173595 0.132024 0.00000 0.00000 \n", "2 0.15335 0.00000 0.000000 0.000000 0.116626 0.15335 0.15335 \n", "\n", " diverse festivals genres highlife is music popular \\\n", "0 0.26592 0.00000 0.000000 0.000000 0.26592 0.202239 0.000000 \n", "1 0.00000 0.00000 0.173595 0.173595 0.00000 0.000000 0.173595 \n", "2 0.00000 0.15335 0.000000 0.000000 0.00000 0.116626 0.000000 \n", "\n", " the \n", "0 0.00000 \n", "1 0.00000 \n", "2 0.15335 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf_transformer_9 = TfidfTransformer(norm='l1')\n", "docs_9 = [\"African music is diverse.\", \n", " \"Afrobeats and Highlife are popular genres.\", \n", " \"Music festivals are common across the continent.\"]\n", "\n", "X9_count = count_vect.fit_transform(docs_9)\n", "X9_tfidf = tfidf_transformer_9.fit_transform(X9_count)\n", "\n", "afrimusic_df = pd.DataFrame(X9_tfidf.toarray(), columns=count_vect.get_feature_names_out())\n", "afrimusic_df\n" ] }, { "cell_type": "markdown", "id": "female-james", "metadata": {}, "source": [ ">#### Task 7: \n" ] }, { "cell_type": "markdown", "id": "collective-wright", "metadata": {}, "source": [ "## Analyzing News Articles on African Youth Unemployment\n" ] }, { "cell_type": "markdown", "id": "conventional-interface", "metadata": {}, "source": [ "You're a sociologist who's investigating the portrayal of youth unemployment in African news media. You've collected several news articles discussing youth unemployment in various African nations. Your aim is to identify the most discussed themes and assess the importance of different terms in the articles using feature extraction methods.\n", "\n", "\n", "1. Load the youth employment articles using the following command `%load `youth_emp_article.py`\n", "\n", "2. Tokenize the articles into individual words.\n", "\n", "3. Use the CountVectorizer to count word occurrences.\n", "\n", "4. Use the TfidfVectorizer to compute the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term.\n", "\n", "5. Alternatively, use the TfidfTransformer to compute TF-IDF values if starting with raw count from CountVectorizer.\n", "\n", "6. Analyze the top terms to understand the main themes in the articles." ] }, { "cell_type": "markdown", "id": "complete-ghana", "metadata": {}, "source": [ ">#### Task 8: \n" ] }, { "cell_type": "markdown", "id": "sunset-fraction", "metadata": {}, "source": [ "## Analyzing Economic Reports on the African Agricultural Export Potential" ] }, { "cell_type": "markdown", "id": "flying-drive", "metadata": {}, "source": [ "You're an economist at the African Union's Department of Economic Affairs. With increasing talks about intra-African trade and global exports, you've gathered several economic reports discussing the potential of African agricultural exports and their economic impact. Your goal is to extract insights about the most emphasized agricultural commodities and understand the most significant themes across the reports using text analysis techniques.\n", "\n", "1. Load the youth economics reports using the following command `%load eco_reports.py`\n", "\n", "2. Tokenize the economic reports into individual words.\n", "\n", "3. Use the CountVectorizer to compute the frequency of word occurrences.\n", "\n", "4. Apply the TfidfVectorizer to determine the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term.\n", "Alternatively, if starting with raw counts from CountVectorizer, use the TfidfTransformer to calculate TF-IDF values.\n", "\n", "5. Evaluate the top terms to decipher the primary commodities and themes in the economic reports.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "polar-serial", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 }